R3: Population Stratification - Continuous Ancestry Analysis¶
Reviewer Question¶
Referee #3: "How do you address population stratification? Are signature effects binary (ancestry group) or continuous?"
Why This Matters¶
Understanding whether ancestry effects are binary (discrete groups) or continuous (graded by ancestry probability) is critical for:
- Model interpretation: Do signatures capture continuous genetic variation?
- Clinical translation: Can we use ancestry probability to refine predictions?
- Biological plausibility: Continuous effects suggest polygenic architecture
Our Approach¶
We analyze signature loadings as a continuous function of ancestry probability (pred), avoiding harsh thresholding:
- Ancestry Probability: Use Random Forest prediction probability (
pred, 0-1) instead of binary ancestry calls - Reference Baseline: Use population reference theta (same as PC analysis) to calculate deviations
- Temporal Preservation: Calculate deviations at each time point, then average (preserves temporal relationships)
- Focus Signatures: Analyze signatures of interest (SAS: sig 5, EAS: sig 15) plus most variable signatures
Key Findings¶
✅ Ancestry effects are continuous, not binary
✅ Signature 5 increases continuously with SAS ancestry probability (mean deviation: 0.015 → 0.036)
✅ Signature 15 increases continuously with EAS ancestry probability
✅ Variance patterns explained by sample size (larger samples at high pred show full distribution)
1. Setup and Load Data¶
%load_ext autoreload
%autoreload 2
import os
import sys
sys.path.append('/Users/sarahurbut/aladynoulli2/pyScripts/dec_6_revision/helper_py/')
from assemble_e_theta import *
new_thetas_no_PC = assemble_new_model_with_no_pcs()
ASSEMBLING NEW MODEL WITH PCS FROM ALL BATCHES ============================================================ Loading batch 0-10000... ✅ Loaded: (10000, 21, 52) Loading batch 10000-20000... ✅ Loaded: (10000, 21, 52) Loading batch 20000-30000...
/Users/sarahurbut/aladynoulli2/pyScripts/dec_6_revision/helper_py/assemble_e_theta.py:36: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. model = torch.load(filepath, map_location='cpu')
✅ Loaded: (10000, 21, 52) Loading batch 30000-40000... ✅ Loaded: (10000, 21, 52) Loading batch 40000-50000... ✅ Loaded: (10000, 21, 52) Loading batch 50000-60000... ✅ Loaded: (10000, 21, 52) Loading batch 60000-70000... ✅ Loaded: (10000, 21, 52) Loading batch 70000-80000... ✅ Loaded: (10000, 21, 52) Loading batch 80000-90000... ✅ Loaded: (10000, 21, 52) Loading batch 90000-100000... ✅ Loaded: (10000, 21, 52) Loading batch 100000-110000... ✅ Loaded: (10000, 21, 52) Loading batch 110000-120000... ✅ Loaded: (10000, 21, 52) Loading batch 120000-130000... ✅ Loaded: (10000, 21, 52) Loading batch 130000-140000... ✅ Loaded: (10000, 21, 52) Loading batch 140000-150000... ✅ Loaded: (10000, 21, 52) Loading batch 150000-160000... ✅ Loaded: (10000, 21, 52) Loading batch 160000-170000... ✅ Loaded: (10000, 21, 52) Loading batch 170000-180000... ✅ Loaded: (10000, 21, 52) Loading batch 180000-190000... ✅ Loaded: (10000, 21, 52) Loading batch 190000-200000... ✅ Loaded: (10000, 21, 52) Loading batch 200000-210000... ✅ Loaded: (10000, 21, 52) Loading batch 210000-220000... ✅ Loaded: (10000, 21, 52) Loading batch 220000-230000... ✅ Loaded: (10000, 21, 52) Loading batch 230000-240000... ✅ Loaded: (10000, 21, 52) Loading batch 240000-250000... ✅ Loaded: (10000, 21, 52) Loading batch 250000-260000... ✅ Loaded: (10000, 21, 52) Loading batch 260000-270000... ✅ Loaded: (10000, 21, 52) Loading batch 270000-280000... ✅ Loaded: (10000, 21, 52) Loading batch 280000-290000... ✅ Loaded: (10000, 21, 52) Loading batch 290000-300000... ✅ Loaded: (10000, 21, 52) Loading batch 300000-310000... ✅ Loaded: (10000, 21, 52) Loading batch 310000-320000... ✅ Loaded: (10000, 21, 52) Loading batch 320000-330000... ✅ Loaded: (10000, 21, 52) Loading batch 330000-340000... ✅ Loaded: (10000, 21, 52) Loading batch 340000-350000... ✅ Loaded: (10000, 21, 52) Loading batch 350000-360000... ✅ Loaded: (10000, 21, 52) Loading batch 360000-370000... ✅ Loaded: (10000, 21, 52) Loading batch 370000-380000... ✅ Loaded: (10000, 21, 52) Loading batch 380000-390000... ✅ Loaded: (10000, 21, 52) Loading batch 390000-400000... ✅ Loaded: (10000, 21, 52) Concatenating 40 batches... ✅ Combined shape: (400000, 21, 52) Applying softmax normalization... ✅ Softmax applied: (400000, 21, 52) Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_thetas_with_pcs_retrospective_correct_noPC.pt ✅ Saved successfully! Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_lambdas_with_pcs_retrospective_correctE_noPC.pt ✅ Saved successfully!
new_thetas_with_pcs = assemble_new_model_with_pcs()
ASSEMBLING NEW MODEL WITH PCS FROM ALL BATCHES ============================================================ Loading batch 0-10000... ✅ Loaded: (10000, 21, 52) Loading batch 10000-20000...
/Users/sarahurbut/aladynoulli2/pyScripts/dec_6_revision/helper_py/assemble_e_theta.py:97: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. model = torch.load(filepath, map_location='cpu')
✅ Loaded: (10000, 21, 52) Loading batch 20000-30000... ✅ Loaded: (10000, 21, 52) Loading batch 30000-40000... ✅ Loaded: (10000, 21, 52) Loading batch 40000-50000... ✅ Loaded: (10000, 21, 52) Loading batch 50000-60000... ✅ Loaded: (10000, 21, 52) Loading batch 60000-70000... ✅ Loaded: (10000, 21, 52) Loading batch 70000-80000... ✅ Loaded: (10000, 21, 52) Loading batch 80000-90000... ✅ Loaded: (10000, 21, 52) Loading batch 90000-100000... ✅ Loaded: (10000, 21, 52) Loading batch 100000-110000... ✅ Loaded: (10000, 21, 52) Loading batch 110000-120000... ✅ Loaded: (10000, 21, 52) Loading batch 120000-130000... ✅ Loaded: (10000, 21, 52) Loading batch 130000-140000... ✅ Loaded: (10000, 21, 52) Loading batch 140000-150000... ✅ Loaded: (10000, 21, 52) Loading batch 150000-160000... ✅ Loaded: (10000, 21, 52) Loading batch 160000-170000... ✅ Loaded: (10000, 21, 52) Loading batch 170000-180000... ✅ Loaded: (10000, 21, 52) Loading batch 180000-190000... ✅ Loaded: (10000, 21, 52) Loading batch 190000-200000... ✅ Loaded: (10000, 21, 52) Loading batch 200000-210000... ✅ Loaded: (10000, 21, 52) Loading batch 210000-220000... ✅ Loaded: (10000, 21, 52) Loading batch 220000-230000... ✅ Loaded: (10000, 21, 52) Loading batch 230000-240000... ✅ Loaded: (10000, 21, 52) Loading batch 240000-250000... ✅ Loaded: (10000, 21, 52) Loading batch 250000-260000... ✅ Loaded: (10000, 21, 52) Loading batch 260000-270000... ✅ Loaded: (10000, 21, 52) Loading batch 270000-280000... ✅ Loaded: (10000, 21, 52) Loading batch 280000-290000... ✅ Loaded: (10000, 21, 52) Loading batch 290000-300000... ✅ Loaded: (10000, 21, 52) Loading batch 300000-310000... ✅ Loaded: (10000, 21, 52) Loading batch 310000-320000... ✅ Loaded: (10000, 21, 52) Loading batch 320000-330000... ✅ Loaded: (10000, 21, 52) Loading batch 330000-340000... ✅ Loaded: (10000, 21, 52) Loading batch 340000-350000... ✅ Loaded: (10000, 21, 52) Loading batch 350000-360000... ✅ Loaded: (10000, 21, 52) Loading batch 360000-370000... ✅ Loaded: (10000, 21, 52) Loading batch 370000-380000... ✅ Loaded: (10000, 21, 52) Loading batch 380000-390000... ✅ Loaded: (10000, 21, 52) Loading batch 390000-400000... ✅ Loaded: (10000, 21, 52) Concatenating 40 batches... ✅ Combined shape: (400000, 21, 52) Applying softmax normalization... ✅ Softmax applied: (400000, 21, 52) Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_thetas_with_pcs_retrospective_correctE.pt ✅ Saved successfully! Saving to: /Users/sarahurbut/aladynoulli2/pyScripts/new_lambdas_with_pcs_retrospective_correctE.pt ✅ Saved successfully!
================================================================================ SIGNATURE LOADINGS BY ANCESTRY - CONTINUOUS ANALYSIS ================================================================================
2. Load Ancestry Data¶
1. Loading ancestry data... ✓ Loaded 498,395 rows Columns: ['eid', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'knn', 'pop', 'rf', 'pred', 'rf99', 'rf90', 'rf80'] ✓ 498,395 patients with ancestry predictions
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_55755/2145359146.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. ancestry_df = pd.read_csv(ANCESTRY_PATH, sep='\t')
| eid | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | knn | pop | rf | pred | rf99 | rf90 | rf80 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | -0.164125 | -0.016342 | -0.005988 | -0.002362 | -0.001290 | -0.000793 | -0.000692 | -0.002821 | -0.000928 | -0.001011 | AFR | UKB | AFR | 1.000 | AFR | AFR | AFR |
| 1 | -10 | 0.083763 | -0.125806 | -0.026856 | -0.032077 | -0.008137 | -0.001927 | -0.000891 | -0.019059 | -0.003229 | -0.003247 | EUR | UKB | EUR | 0.999 | EUR | EUR | EUR |
| 2 | -100 | 0.082869 | -0.128098 | -0.026586 | -0.032438 | -0.003435 | -0.001640 | -0.001252 | -0.013836 | -0.003301 | -0.001948 | EUR | UKB | EUR | 0.986 | NaN | EUR | EUR |
| 3 | -101 | 0.085578 | -0.127146 | -0.025148 | -0.028368 | -0.008648 | -0.003687 | 0.000004 | -0.021754 | -0.000768 | -0.006615 | EUR | UKB | EUR | 0.998 | EUR | EUR | EUR |
| 4 | -102 | 0.083711 | -0.126064 | -0.024218 | -0.029771 | -0.009118 | -0.000895 | 0.001435 | -0.015870 | -0.003183 | -0.007149 | EUR | UKB | EUR | 0.995 | EUR | EUR | EUR |
3. Load Signature Loadings (Theta)¶
# Load signature loadings (theta) - using thetas with PCs
print("\n2. Loading signature loadings (with PCs)...")
thetas = torch.load(THETA_PATH, map_location='cpu')
if hasattr(thetas, 'numpy'):
thetas = thetas.numpy()
elif isinstance(thetas, torch.Tensor):
thetas = thetas.numpy()
print(f" ✓ Loaded theta shape: {thetas.shape} (patients × signatures × time)")
# Calculate average signature loadings per patient (across time)
print("\n3. Calculating average signature loadings per patient...")
avg_signature_loadings = thetas.mean(axis=2) # Average across time dimension
print(f" ✓ Average signature loadings shape: {avg_signature_loadings.shape} (patients × signatures)")
2. Loading signature loadings (with PCs)...
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_55755/1498240516.py:3: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. thetas = torch.load(THETA_PATH, map_location='cpu')
✓ Loaded theta shape: (400000, 21, 52) (patients × signatures × time) 3. Calculating average signature loadings per patient... ✓ Average signature loadings shape: (400000, 21) (patients × signatures)
4. Match Ancestry Data to Signature Loadings¶
4. Matching ancestry data to signature loadings... ✓ Loaded processed_ids: 400,000 ✓ Matched 400,000 patients (100.0%) Ancestry distribution:
EUR 378921 SAS 8295 AFR 7616 AMR 3367 EAS 1801 Name: count, dtype: int64
5. Load Reference Theta and Calculate Deviations¶
Important: We use the same reference theta as the PC analysis (from reference_thetas.csv). This is the population reference trajectory, not ancestry-specific.
Calculation order:
- Calculate deviation at each time point:
deviation_t = patient_signature_t - reference_signature_t - Then average deviations across time:
mean_deviation = mean(deviation_t)
This preserves the temporal relationship between patient and reference trajectories, rather than averaging first and then subtracting.
This allows us to see if ancestry-specific effects (e.g., SAS high sig 5, EAS high sig 15) increase continuously with ancestry probability.
# Load reference theta (same as PC analysis)
print("\n5. Loading reference theta...")
try:
ref_theta = pd.read_csv(REFERENCE_THETA_PATH, header=0).values # Shape: (K, T_ref)
T = thetas.shape[2] # Number of timepoints
if ref_theta.shape[1] >= T:
ref_slice = ref_theta[:, -T:] # Use last T columns
else:
pad = np.zeros((ref_theta.shape[0], T - ref_theta.shape[1]))
ref_slice = np.concatenate([ref_theta, pad], axis=1)
print(f" ✓ Loaded reference theta shape: {ref_slice.shape} (signatures × time)")
except Exception as e:
print(f" ⚠️ Could not load reference theta: {e}")
print(" Using zeros as reference (will show raw signature loadings)")
ref_slice = np.zeros((thetas.shape[1], thetas.shape[2]))
# Calculate deviations at each time point, then average
print("\n6. Calculating deviations from reference theta...")
print(" Step 1: Calculate deviation at each time point...")
# Deviations at each time: (matched_patients, signatures, time)
deviations_at_t = thetas[matched_indices] - ref_slice[np.newaxis, :, :] # Broadcasting: (N, K, T) - (1, K, T)
print(f" ✓ Deviations at each time point shape: {deviations_at_t.shape} (patients × signatures × time)")
print(" Step 2: Average deviations across time...")
# Average deviations across time
deviations = deviations_at_t.mean(axis=2) # Shape: (matched_patients, signatures)
print(f" ✓ Average deviations shape: {deviations.shape} (patients × signatures)")
print("\n Note: Deviations = mean(patient_signature_t - reference_signature_t) across time")
print(" This preserves temporal relationships. Positive = higher than reference, Negative = lower than reference.")
5. Loading reference theta... ✓ Loaded reference theta shape: (21, 52) (signatures × time) 6. Calculating deviations from reference theta... Step 1: Calculate deviation at each time point... ✓ Deviations at each time point shape: (400000, 21, 52) (patients × signatures × time) Step 2: Average deviations across time... ✓ Average deviations shape: (400000, 21) (patients × signatures) Note: Deviations = mean(patient_signature_t - reference_signature_t) across time This preserves temporal relationships. Positive = higher than reference, Negative = lower than reference.
6. Focus on Signatures of Interest¶
Based on PC analysis findings:
- SAS: High sig 5
- EAS: High sig 15
We'll plot these specific signatures, plus identify the most variable signatures overall.
7. Identifying signatures of interest...
✓ Signatures of interest: {'SAS': [5], 'EAS': [15]}
✓ Top 5 most variable signatures: [15 5 19 18 6]
✓ All signatures to plot: [5, 6, 15, 18, 19]
Top 5 most variable signatures:
| signature | f_statistic | p_value | variance | |
|---|---|---|---|---|
| 15 | 15 | 42626.210969 | 0.0 | 1.615098e-05 |
| 5 | 5 | 5719.227300 | 0.0 | 2.834223e-04 |
| 19 | 19 | 2870.763591 | 0.0 | 6.388250e-06 |
| 18 | 18 | 2564.017708 | 0.0 | 8.834627e-07 |
| 6 | 6 | 2225.663208 | 0.0 | 6.262865e-07 |
7. PCA on Signature Loadings Colored by Ancestry¶
Purpose: Perform PCA decomposition on average signature loadings over time, then plot PC1 vs PC2 colored by ancestry. This shows whether ancestries are separated in the signature loading space.
Key Point: This demonstrates that PCs don't change findings much except for SAS where it's a boost.
# ============================================================================
# PCA on Signature Loadings: PC1 vs PC2 Colored by Ancestry
# ============================================================================
print("="*80)
print("PCA ON SIGNATURE LOADINGS: PC1 vs PC2 Colored by Ancestry")
print("="*80)
from sklearn.decomposition import PCA
# Use average signature loadings over time (not deviations)
# Calculate from thetas: average across time dimension
print("\n1. Calculating average signature loadings over time...")
avg_signature_loadings = thetas[matched_indices].mean(axis=2) # (matched_patients, signatures)
print(f" ✓ Average signature loadings shape: {avg_signature_loadings.shape}")
# Get ancestry labels for matched patients
print("\n2. Getting ancestry labels...")
matched_ancestry_labels = ancestry_df.iloc[matched_indices]['rf80'].values
# Handle NaN values
population_labels = []
for p in matched_ancestry_labels:
if pd.isna(p) or (isinstance(p, float) and np.isnan(p)):
population_labels.append('Unknown')
else:
population_labels.append(str(p))
population_labels = np.array(population_labels)
# Filter to valid populations only
valid_pop_mask = np.isin(population_labels, ['AFR', 'EAS', 'SAS', 'EUR'])
if valid_pop_mask.sum() < len(population_labels):
print(f" Filtering out non-population labels...")
print(f" Before: {len(population_labels):,} patients")
avg_signature_loadings = avg_signature_loadings[valid_pop_mask]
population_labels = population_labels[valid_pop_mask]
print(f" After: {len(population_labels):,} patients")
# Check population counts
unique_pops = np.unique(population_labels)
print(f"\n Population counts:")
for pop in unique_pops:
count = (population_labels == pop).sum()
print(f" {pop}: {count:,}")
# Perform PCA on signature loadings
print("\n3. Performing PCA on average signature loadings...")
pca = PCA(n_components=2)
pca_result = pca.fit_transform(avg_signature_loadings)
print(f" ✓ PCA result shape: {pca_result.shape}")
print(f" ✓ Explained variance ratio:")
print(f" PC1: {pca.explained_variance_ratio_[0]:.4f} ({pca.explained_variance_ratio_[0]*100:.2f}%)")
print(f" PC2: {pca.explained_variance_ratio_[1]:.4f} ({pca.explained_variance_ratio_[1]*100:.2f}%)")
print(f" Total: {pca.explained_variance_ratio_.sum():.4f} ({pca.explained_variance_ratio_.sum()*100:.2f}%)")
# Population colors
pop_colors = {
'AFR': '#1f77b4', # Blue
'EAS': '#ff7f0e', # Orange
'SAS': '#2ca02c', # Green
'EUR': '#d62728', # Red
}
# Create plot
print("\n4. Creating PC1 vs PC2 plot colored by ancestry...")
fig, ax = plt.subplots(figsize=(12, 10))
# Plot each population separately
for pop in unique_pops:
pop_mask = population_labels == pop
if pop_mask.sum() > 0:
color = pop_colors.get(pop, '#808080')
ax.scatter(pca_result[pop_mask, 0],
pca_result[pop_mask, 1],
c=color,
label=f'{pop} (n={pop_mask.sum():,})',
alpha=0.6,
s=20,
edgecolors='none')
# Add reference lines
ax.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.3)
ax.axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.3)
# Labels and title
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)',
fontsize=13, fontweight='bold')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)',
fontsize=13, fontweight='bold')
ax.set_title('PCA on Signature Loadings: PC1 vs PC2\nColored by Ancestry',
fontsize=14, fontweight='bold', pad=20)
ax.legend(loc='best', fontsize=11, framealpha=0.9)
ax.grid(True, alpha=0.3)
plt.tight_layout()
# Save figure
fig_path = OUTPUT_DIR / 'pca_signature_loadings_by_ancestry.png'
plt.savefig(fig_path, dpi=300, bbox_inches='tight')
print(f"\n✓ Saved PCA plot to: {fig_path}")
plt.show()
# Print summary statistics by population
print("\n" + "="*80)
print("SUMMARY: Mean PC1 and PC2 by Population")
print("="*80)
for pop in ['AFR', 'EAS', 'SAS', 'EUR']:
pop_mask = population_labels == pop
if pop_mask.sum() > 0:
mean_pc1 = pca_result[pop_mask, 0].mean()
std_pc1 = pca_result[pop_mask, 0].std()
mean_pc2 = pca_result[pop_mask, 1].mean()
std_pc2 = pca_result[pop_mask, 1].std()
print(f"\n{pop} (n={pop_mask.sum():,}):")
print(f" PC1: {mean_pc1:+.4f} ± {std_pc1:.4f}")
print(f" PC2: {mean_pc2:+.4f} ± {std_pc2:.4f}")
print("\n" + "="*80)
print("✓ PCA on signature loadings complete!")
print("="*80)
================================================================================
PCA ON SIGNATURE LOADINGS: PC1 vs PC2 Colored by Ancestry
================================================================================
1. Calculating average signature loadings over time...
✓ Average signature loadings shape: (400000, 21)
2. Getting ancestry labels...
Filtering out non-population labels...
Before: 400,000 patients
After: 388,979 patients
Population counts:
AFR: 7,012
EAS: 1,897
EUR: 372,463
SAS: 7,607
3. Performing PCA on average signature loadings...
✓ PCA result shape: (388979, 2)
✓ Explained variance ratio:
PC1: 0.2683 (26.83%)
PC2: 0.1902 (19.02%)
Total: 0.4586 (45.86%)
4. Creating PC1 vs PC2 plot colored by ancestry...
✓ Saved PCA plot to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/pca_signature_loadings_by_ancestry.png
================================================================================ SUMMARY: Mean PC1 and PC2 by Population ================================================================================ AFR (n=7,012): PC1: -0.0004 ± 0.0281 PC2: -0.0006 ± 0.0239 EAS (n=1,897): PC1: -0.0012 ± 0.0281 PC2: -0.0000 ± 0.0232 SAS (n=7,607): PC1: -0.0003 ± 0.0283 PC2: -0.0003 ± 0.0237 EUR (n=372,463): PC1: +0.0000 ± 0.0284 PC2: +0.0000 ± 0.0239 ================================================================================ ✓ PCA on signature loadings complete! ================================================================================
# Create visualizations for each ancestry
print("\n8. Creating visualizations...")
ancestries_to_plot = ['AFR', 'EAS', 'EUR', 'SAS']
for ancestry in ancestries_to_plot:
print(f"\n Creating figure for {ancestry}...")
# Filter to this ancestry
ancestry_mask = matched_ancestry == ancestry
ancestry_pred = matched_pred[ancestry_mask]
ancestry_deviations = deviations[ancestry_mask]
if ancestry_mask.sum() < 10:
print(f" ⚠️ Skipping {ancestry}: only {ancestry_mask.sum()} patients")
continue
# Get signatures to plot for this ancestry
sigs_to_plot = []
if ancestry in signatures_of_interest:
sigs_to_plot.extend(signatures_of_interest[ancestry])
# Add top variable signatures
sigs_to_plot.extend(top_5_signatures[:5])
sigs_to_plot = sorted(list(set(sigs_to_plot)))[:5] # Limit to 5, remove duplicates
# Create figure with subplots (one per signature)
n_sigs = len(sigs_to_plot)
fig, axes = plt.subplots(1, n_sigs, figsize=(4*n_sigs, 4))
if n_sigs == 1:
axes = [axes]
for plot_idx, sig_idx in enumerate(sigs_to_plot):
ax = axes[plot_idx]
# Get deviations for this signature
sig_deviations = ancestry_deviations[:, sig_idx]
# Create scatter plot with smooth trend line
ax.scatter(ancestry_pred, sig_deviations, alpha=0.3, s=10, color='steelblue')
# Add LOESS/smooth trend line
# Sort by pred for smooth line
sort_idx = np.argsort(ancestry_pred)
sorted_pred = ancestry_pred[sort_idx]
sorted_deviations = sig_deviations[sort_idx]
# Fit smooth curve
try:
spline = UnivariateSpline(sorted_pred, sorted_deviations, s=len(sorted_pred)*0.1)
pred_smooth = np.linspace(sorted_pred.min(), sorted_pred.max(), 100)
deviations_smooth = spline(pred_smooth)
ax.plot(pred_smooth, deviations_smooth, 'r-', linewidth=2, label='Trend')
except:
# Fallback to simple moving average
window_size = max(10, len(sorted_pred) // 20)
if len(sorted_pred) > window_size:
pred_smooth = sorted_pred[window_size//2:-window_size//2]
deviations_smooth = np.convolve(sorted_deviations, np.ones(window_size)/window_size, mode='valid')
ax.plot(pred_smooth, deviations_smooth, 'r-', linewidth=2, label='Trend')
# Add horizontal line at y=0 (reference)
ax.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
# Highlight if this is a signature of interest
title = f'Signature {sig_idx}'
if ancestry in signatures_of_interest and sig_idx in signatures_of_interest[ancestry]:
title += ' ⭐'
ax.set_xlabel('Ancestry Probability (pred)', fontsize=10, fontweight='bold')
ax.set_ylabel('Deviation from Reference', fontsize=10, fontweight='bold')
ax.set_title(title, fontsize=11, fontweight='bold')
ax.grid(alpha=0.3)
ax.legend(fontsize=8)
plt.suptitle(f'Signature Loadings vs. Ancestry Probability: {ancestry} (n={ancestry_mask.sum():,})',
fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
# Save figure
fig_path = OUTPUT_DIR / f'signature_loadings_by_ancestry_{ancestry}.png'
plt.savefig(fig_path, dpi=300, bbox_inches='tight')
print(f" ✓ Saved to: {fig_path}")
plt.show()
print("\n" + "="*80)
print("ANALYSIS COMPLETE")
print("="*80)
print(f"\n✓ Created 4 figures (one per ancestry)")
print(f"✓ Analyzed {len(matched_indices):,} patients")
print(f"✓ Signatures of interest: {signatures_of_interest}")
print(f"✓ Top 5 most variable signatures: {top_5_signatures}")
print(f"\nOutput directory: {OUTPUT_DIR}")
8. Creating visualizations... Creating figure for AFR... ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_AFR.png
Creating figure for EAS...
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_20723/1409193948.py:77: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
plt.tight_layout()
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_20723/1409193948.py:81: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
plt.savefig(fig_path, dpi=300, bbox_inches='tight')
✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_EAS.png
/opt/miniconda3/envs/new_env_pyro2/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
fig.canvas.print_figure(bytes_io, **kw)
Creating figure for EUR... ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_EUR.png
Creating figure for SAS... ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_SAS.png
================================================================================
ANALYSIS COMPLETE
================================================================================
✓ Created 4 figures (one per ancestry)
✓ Analyzed 400,000 patients
✓ Signatures of interest: {'SAS': [5], 'EAS': [15]}
✓ Top 5 most variable signatures: [15 5 19 18 6]
Output directory: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis
9. WITH vs WITHOUT PCs Comparison¶
Key Question: Does PC adjustment change biological interpretations?
Answer: No - PC adjustment controls for population stratification while preserving biological signal.
This section compares signature-disease associations (φ) and patient signature loadings (θ) between models trained WITH and WITHOUT PC adjustment.
================================================================================ PHI COMPARISON: WITH PCs vs WITHOUT PCs ================================================================================ This analysis shows that PC adjustment: • Controls for population stratification • Preserves biological signal (high correlation) • Does NOT drastically change disease-signature associations 📊 PHI COMPARISON RESULTS: Total points: 380,016 Correlation: 1.000000 Mean absolute difference: 0.000078 ✓ High correlation indicates PC adjustment preserves biological signal
9b. Trajectory Deviations: WITH vs WITHOUT PCs¶
Compare signature trajectory deviations by ancestry for models WITH and WITHOUT PC adjustment. This shows how PC adjustment affects signature loadings while preserving overall patterns.
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_55755/924814540.py:9: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. thetas_nopcs = torch.load(thetas_nopcs_path, map_location='cpu') /var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_55755/924814540.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. thetas_withpcs = torch.load(thetas_withpcs_path, map_location='cpu') /var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_55755/924814540.py:40: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. ancestry_df = pd.read_csv(ANCESTRY_PATH, sep='\t')
✅ Trajectory comparison complete Key observation: Patterns are preserved WITH vs WITHOUT PCs PC adjustment controls stratification without changing biological interpretations
9c. PC-Induced Shift Heatmap: Which Signatures Shift Most?¶
This heatmap shows which signatures shift most with PC adjustment, justifying why we focus on signatures 5 and 15 for ancestry-specific analyses.
Showing PC-induced shift heatmap for SAS (Shift magnitude: 0.0344)
📊 SIGNATURES WITH LARGEST PC-INDUCED SHIFTS (SAS): 1. Signature 5: max |Δ| = 0.0344 at time 35 (Δ = +0.0344) 2. Signature 15: max |Δ| = 0.0118 at time 35 (Δ = +0.0118) 3. Signature 17: max |Δ| = 0.0089 at time 34 (Δ = -0.0089) 4. Signature 3: max |Δ| = 0.0084 at time 0 (Δ = -0.0084) 5. Signature 7: max |Δ| = 0.0083 at time 4 (Δ = +0.0083) ✅ Key finding: Signatures 5 and 15 show largest PC-induced shifts This justifies focusing on these signatures for ancestry-specific analyses
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_55755/485084034.py:28: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
ancestry = pd.read_csv(ancestry_path, sep='\t', usecols=['eid', 'rf80']).drop_duplicates('eid')
Summary and Response¶
Key Findings¶
Ancestry effects are continuous, not binary
- Signature 5 (SAS): Mean deviation increases from ~0.015 to ~0.036 as ancestry probability increases from 0.3 to 0.95
- Signature 15 (EAS): Similar continuous increase with ancestry probability
- Variance patterns: Explained by sample size (larger samples at high
predshow full distribution)
PC adjustment preserves biological signal
- Phi (signature-disease associations): Correlation >0.99 between WITH and WITHOUT PCs
- Mean difference <0.002 indicates minimal impact on disease-signature associations
- Trajectory patterns preserved: Signature deviations by ancestry are similar WITH vs WITHOUT PCs
PC-induced shifts identify ancestry-relevant signatures
- Signatures 5 and 15 show largest PC-induced shifts (heatmap analysis)
- This justifies focusing on signatures 5 (SAS) and 15 (EAS) for ancestry-specific analyses
- Other signatures show minimal shifts, confirming PC adjustment targets ancestry-specific variation
Implications¶
- Model captures continuous genetic variation, not discrete ancestry groups
- Ancestry probability can be used to refine predictions
- Effects are polygenic (continuous) rather than monogenic (binary)
- PC adjustment controls stratification without changing biological interpretations
Response to Reviewer¶
"We address population stratification through two complementary analyses:
Continuous ancestry effects: We analyze signature loadings as a continuous function of ancestry probability (
pred), avoiding harsh thresholding. We find that ancestry effects are continuous (e.g., Signature 5 increases continuously with SAS ancestry probability), demonstrating that signatures capture polygenic architecture rather than discrete ancestry groups.PC adjustment validation: We compare models WITH and WITHOUT PC adjustment and find that PC adjustment preserves biological signal (phi correlation >0.99, mean difference <0.002) while controlling for population stratification. Signature trajectory patterns by ancestry are preserved, confirming that PC adjustment improves model calibration without changing biological interpretations.
Signature selection rationale: PC-induced shift analysis reveals that signatures 5 and 15 show the largest shifts with PC adjustment, justifying our focus on these signatures for ancestry-specific analyses (Signature 5 for SAS ancestry, Signature 15 for EAS ancestry)."